Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Coscheduling #2337

Closed
wants to merge 1 commit into from
Closed

KEP: Coscheduling #2337

wants to merge 1 commit into from

Conversation

k82cn
Copy link
Member

@k82cn k82cn commented Jul 3, 2018

Signed-off-by: Da K. Ma klaus1982.cn@gmail.com

/cc @vishh, @bsalamat, @mindprince, @jlewi

/cc @kubernetes/sig-scheduling-feature-requests

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 3, 2018
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jul 3, 2018

## References

* [Gang scheduling in Kuberentes](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo Kuberentes /Kubernetes

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @k82cn and sorry for the delay in reviewing the PR.


## Motivation

After the discussion at [Gang-scheduling](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv) proposal, we decide to make Gang-scheduling API out of core by CRD, and implement it in [kube-arbitrator](https://github.com/kubernetes-incubator/kube-arbitrator). kube-arbitrator focus on "batch" workload in kubernetes, and will share the same [scheduling frameworks](https://github.com/kubernetes/community/pull/2281) when it's ready. This document is used to provide definition of CRD for gang-scheduling, and suggestion of customized job controllers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/focus/focuses/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend focusing on one feature at a time. If that's gang scheduling, just talk about how k8s would enable it either via default scheduler or a new scheduler under incubation.


### CRD Definition

Although workload lifecycle requirements maybe different, e.g. MPI vs. Spark, the requirements to scheduler are similar. To meet both scheduling and lifecycle requirements, two **Kinds** are going to be defined. One new **Kind**, named `SchedulingSpec`, is defined as follow for scheduler:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/maybe/may be/
s/kind/type/ ?

// this field from PriorityClassName.
// The higher the value, the higher the priority.
// +optional
Priority *int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should remove this field from the template. Priority is supposed to be resolved at the time of Pod creation. It exists in PodSpec, but it doesn't make sense to be in the template. This field is not expected to be copied to PodSpec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing Priority for "Job" here; and the priority of Pod will only take effect within a "Job".

There's the requirements on Job priority, IMO: kubernetes/kubernetes#65746

// fit on a node. Selector which must match a node's labels for the pod of
// 'Job' to be scheduled on that node.
// +optional
NodeSelector map[string]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeSelector is in a way an earlier and inferior version of NodeAffinity. Why not use NodeAffinity instead of NodeSelector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, NodeAffinity is better; using NodeSelector because of simplification in prototype :)

}
```

The `SchedulingSpec` defines necessary parameters of Job for scheduler, e.g. priority of Job, minimal available tasks of Job. `SchedulingSpec` does not include lifecycle parameters which are managed by following new **Kind**, named `QueueJob`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/defines necessary parameters of Job for scheduler/specifies the scheduling constraints of a Job/
s/minimal/minimum/


// The minimal available pods to run for this QueueJob
// +optional
MinAvailable int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this type shows status of a job, shouldn't this field be NumAvailable?


### kar-scheduler (scheduler component of kube-arbitrator)

The scheduler only watches `SchedulingSpec` and `Pod`; it'll reconstruct 'Job' by `OwerReference` of controller (`OwnerReference.Controller == true`), and the `Pod`s are consider as 'Task' of 'Job'. That requires controller set the `OwerReference` correctly. Different from other components, `kar-scheduler` does not use `LabelSelector` to reconstruct the job because of performance concern.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/consider/considered/


2. `Number of Running pods` >= `.spec.SchedSpec.minAvailable`: If there are still enough running pods for the `QueueJob`, the `QueueJobController` will only recreate failed pods, and wait for scheduler to bind the pending pods.

* **QueueJob Update/Delete**: There's not rolling update for `QueueJob` ; if pod template was updated, the whole `QueueJob` is re-created. If `QueueJob` was deleted, all its pods and `SchedulingSpec` will be cleanup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plans to support running long-running services as a gang (QueueJob) as well? If we don't, we may not need to support rolling updates for batch jobs which are expected to run to completion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plans to support running long-running services as a gang (QueueJob) as well?

Good question! I used to plan to support long-running services too; but for now, did not get requirement right now. So I'd like to make it batch work only firstly to make first release simple.

/cc @mindprince , fyi, some changes since we last sync up.


* **Pod Failures/Deletion/Unschedulable**: When a pod was failed, deleted or unschedulable, `QueueJobController` manages `QueueJob`'s lifecycle according to its status:

1. `Number of Running pods` < `.spec.SchedSpec.minAvailable` : If the number of running pods less then `.spec.SchedSpec.minAvailable`, the `QueueJobController` will kill the whole QueueJob and recreate pods for rescheduling in batch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you deal with the last set of running jobs? For example, let's say that minAvailable is 20. Total number of tasks in the QueueJob is 100. 85 of them have already run and finished. 15 of them are still running. 15 is less than minAvailable, but we should let them run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; it need to include Succeeded pods: Succeeded + Runinning <= minAvailable

}
```

`QueueJob` is the default **Kind** for batch job, it includes multiple types of tasks according to the requirements from ML frameworks, e.g. Tensorflow. `QueueJob` also uses `SchedulingSpec` to describe its requirements to scheduler. The `QueueJob` is managed by `QueueJobController` in `kar-controllers`, the detail function is described in following section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You haven't described users' workflow here, but I guess users only need to create a QueueJob and then the kar-controllers create Pods and SchedulingSpec. Is that right?
If so, why do we need SchedulingSpec? QueueJob has SchedulingSpecTemplate already and kar-scheduler can get MinAvailable from there. Other fields of SchedulingSpecTemplate will be available in the PodSpec of each Pod created for a QueueJob.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You haven't described users' workflow here, but I guess users only need to create a QueueJob and then the kar-controllers create Pods and SchedulingSpec. Is that right?

That's right !

If so, why do we need SchedulingSpec?

SchedulingSpec is used for customized controller, e.g. kubeflow/tf-operator. I do suggest to use QueueJob, but I do not want to make it as restriction currently. For example, kubeflow/tf-operator can using this by replacing PDB with SchedulingSpec.

![queuejob](images/gang-scheduling-queuejob.png)


### Customized controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to cover admission in this document. I think we ideally want to admit either the whole gang or none of the members of a gang. Doing so looks challenging to me with the existing admission controllers. Do you have any ideas on addressing this issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transactional multiobject admission is not possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this mean that we have to necessarily have a wrapper (maybe QueueJob) that will allow for admission of a gang of pods in one shot?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the time scheduling happens, the pod objects have already been admitted, which means object counts, resource quotas, and pod spec restrictions have already been enforced... if the gang scheduler is just matching up pods to nodes, it doesn't seem like admission would cause a lot of problems

Would this mean that we have to necessarily have a wrapper (maybe QueueJob) that will allow for admission of a gang of pods in one shot?

no, you can't end-run around individual pod admission via another object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we ideally want to admit either the whole gang or none of the members of a gang.

It's better to delegate this to controllers or Applications; it's hard for admit to collect "dynamic info", e.g. for ResourceQuota, it's hard for admit to decide which pod should be create to meet 'Gang/PodSchedulingGroup' as much as possible. In google doc, the QueueJob will stop creating new Pod if rejected (there're several corner case to cover)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In google doc, the QueueJob will stop creating new Pod if rejected (there're several corner case to cover)

The problem is that the user experience would be poor. A few pods of a gang are created, but the rest fail admission and are not created. Those pod created will not be scheduled, but will consume the user's quota. I guess the user has to delete them manually. If we cannot have transactional admission, at lease we should have a controller clean up such pods.

Copy link
Member Author

@k82cn k82cn Jul 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few pods of a gang are created, but the rest fail admission and are not created.

For batch or run-to-complete workload, it's better to have some pending jobs there; so k8s can start the pending jobs immediately when some running jobs finished, instead of waiting for client to submit new jobs; hmm...., a kind of pipeline. The difference is that
the "controller" has to make sure not introduce deadlock (e.g. for every pod group, just start part of them), only the "last" pod group should be impacted.

For example, in google doc, the QueueJobController will create pods one by one based on QueueJob's minAvaialbe; if minAvaialbe pods are not created, it'll NOT create pods for next QueueJob. If all minAvailable are meet, QueueJobController will create pods up to .spec.replicas in round-robin for each QueueJob. There're still some cases are not covered, e.g. update minAvailabe for a better SLA, new high priority job, start other jobs if can not meet minAvailable. That's another reason that prefer to focus on batch/run-to-complete firstly, and introduce QueueJob; I'm also ok to enhance Job to support those cases in future, and maybe Queue.

Currently, AdmissionController checks pod based on the 'snapshot' of status/config, e.g. ResourceQuota. But for gang scheduling, it also need to check whether controller WILL create more pods later to meet gang requirement. And a wrapper object can not help as some AdmissionController are still doing pod level check.

Another corner case is that: a hug job (e.g. 1000+ tasks) waits for long running run-to-complete jobs (e.g. hours or days) :(
The candidate solution is to support Queue, and balance the resource sharing between Queue. This's also in backlog.


## Function Detail

### CRD Definition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slightly different idea about the API. Can we create a namespaced object, let's call it a gang, that only specifies properties of a gang. Something like:

type Gang struct {
    metav1.TypeMeta
    metav1.ObjectMeta

    NumMembers int
    Policy GangLifeCyclePolicy
}

Then Pod.PodSpec can have a field called GangName or GangId. When the field is not empty, it means that the pod is gang member. When scheduling pods which are gang members, scheduler makes sure that at least Gang.NumMember of such pods can be scheduled simultaneously or none of them are scheduled.

The Gang type can also include gang life cycle policies. For example, the policy can specify that when the number of running pods drops below NumMember kill all the other instances, or wait for a certain amount of time for the gang members to reach NumMember before killing other instances, ....

Here are some of the benefits of this design:

  1. Existing workloads that use one of the K8s collections, e.g. ReplicaSet, Job, etc., can keep their config and only add the GangName to their pod template.
  2. Flexibility. Pods of a gang can have different PodSpec and belong to different collections.
  3. Existing controllers which are responsible for managing life cycle of collections work well with this design.
  4. Users who have already learned and worked with K8s collections can keep using the familiar collections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like something that should be a label before we create a field on pod spec (also gang wouldn’t be a great name for an api object).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to identify a gang where labels could suffice in theory. We also need to know how many pods comprise a gang (or minimum size of a gang). Encoding that value isn't ideal.
I'd recommend modifying @bsalamat's proposal slightly by including a label selector to identify pods belonging to a gang.

+1 for aiming to have compatibility with existing controllers, Job API at minimum.

Copy link
Member Author

@k82cn k82cn Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to propose PodSchedulingGroup in the google doc, so I totally agree with you to provide such kind of API object in core.

Regarding label selector and 'GangID/GroupID', there're some cases we need to handle:

  • race condition: when a scheduler got a Pod w/o 'Gang/PodSchedulingGroup', how to schedule it?
    • If schedule the pod and no 'Gang/PodSchedulingGroup' coming, that's OK
    • If schedule the pod, 'Gang/PodSchedulingGroup' coming later and meet 'minAvailable', that's OK
    • If schedule the pod, 'Gang/PodSchedulingGroup' coming later but less than 'minAvailable':
      • if it's because of controller delay, schedule should wait
      • if there's no pod anymore, schedule should handle it by LifeCyclePolicy
    • If does not schedule the pod, how long should we wait?
  • performance: for a batch or run-to-complete workload, the pod come and go in short time; it may take time to
    build relationship between 'Gang/PodSchedulingGroup' and Pod
  • summary/status: if 'Gang/PodSchedulingGroup' support cross Kind/Types, it's better to provide a way
    show the summary/status, such succeeded, running, pending

Here's what I did in kube-arbitrator:

  • race condition: as a batch scheduler, kube-arbitrator do scheduling based on 'job'; so it'll not schedule pods
    until get 'Gang/PodSchedulingGroup'
  • performance: using OwnerReference (controller==true); so 'GangID/GroupID' will be better for performance. 'label selector' is
    kind of short term solution to avoid touching more Kind/Type
  • summary/status: it delegate 'Gang/PodSchedulingGroup' owner to handle summary/status

Another minor comment is the name: as you known, Gang-scheduling is one of requirements, we may enhance this object for other requirements later; so a more
general name maybe better, also including GangLifeCyclePolicy, e.g. changing to PodGroupLifeCyclePolicy. And suppose LifeCyclePolicy is going to be handled by kube-arbitrator.

Copy link
Member

@bsalamat bsalamat Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with @k82cn on label selector. I think a pod should identify itself as a gang member, not the label selector of a gang choosing pods. Klaus provided the reasons, so I don't repeat them again.
(Feel free to change the name of the object. I agree that Gang is not a good name for our API. I just used it to present the idea.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to general point raised by @bsalamat here!

For naming, how about something simple like SchedulingGroup? Then the gang-ness can be specified in the policy.


```go
// QueueJob is a Kind for batch workload.
type QueueJob struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were to consider the approach @bsalamat suggested, we would not need a queue job I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; if so, I'd like to enhance 'Job' instead of introducing a new Kind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, multiple pod template (or wrapper of multiple job).

@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Jul 26, 2018
@k82cn k82cn changed the title Design document of Gang scheduling KEP: Gang scheduling Jul 26, 2018
@k82cn
Copy link
Member Author

k82cn commented Jul 26, 2018

@bsalamat @vishh , update the doc according to comments, PTAL :)

  1. Introduce PodSchedulingGroup into core
  2. Added feature interactions section, e.g. Admission

@vishh
Copy link
Contributor

vishh commented Jul 26, 2018 via email

@k82cn
Copy link
Member Author

k82cn commented Jul 26, 2018 via email

@ConnorDoyle
Copy link
Contributor

Is that meeting open to the public? (Asking for an invite)

// MinAvailable defines the minimal available tasks to run the Job;
// if there's not enough resources to start all tasks, the scheduler
// will not start anyone.
MinAvailable int
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/MinAvailable/NumMember/g

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPods?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimum? It's likely rare to set less than Replicas, so this should be optional for simplicity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beberg it may be needed in many cases for a Job or ReplicaSet, etc. to have more replicas than those needed to be run together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's likely rare to set less than Replicas, so this should be optional for simplicity.

PodGroup is a common concept for "batch" feature; other Kind/type can also use that.

}

// PodSchedulingGroupTemplate represents the template of a pod group.
type PodSchedulingGroupTemplate struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Template/Spec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

| Unschedulable | Restart | 1.12 | The `PodSchedulingGroupController` deletes all `Running` Pods of `PodSchedulingGroup` with default grace period; it also records an event for `PodSchedulingGroup` when restarting. It's up to pod's controller to re-create new pods. |
| Unschedulable | None | 1.12 | The `PodSchedulingGroupController` does nothing |

There's a limitation that: when setting policy to `Action: Restart, Phase: Failed`, the `PodSchedulingGroup` may keep restarting in some cases, e.g. bug of the application in pod. The solution, e.g. `RetryCount`, will be proposed in coming release.
Copy link

@beberg beberg Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RetryCount is universal enough it should be specified now to prevent a proliferation of solutions. HPC workloads are prone to failures and checkpoint/restart mechanics are common. RestartPolicy Never leads to far too many failed jobs that could have been resumed automatically (overnight, weekends, ...), so PodSchedulingGroupTemplate.RetryCount is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll discuss this later; IMO, RetryCount is a common requirement, not only to batch.

// according to Pod's phase.
type LifeCyclePolicy struct {
// The action that will be taken to the PodSchedulingGroup according to
// Pod's phase. One of "Restart", "None".
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the context and point of this API is gang scheduling, "Restart" is the more logical default here for both "PodFailed" and "Unschedulable". The none/none can be done with exiting APIs today. Would also like to know the use case in mind for restart/none and none/restart, as they seem to be non-gang scenarios covered by existing APIs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also like to know the use case in mind for restart/none and none/restart

What does "restart/none and none/restart" mean?


```go
// PodSchedulingGroup defines the scheduling requirement of a pod group
type PodSchedulingGroup struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about renaming this to just PodGroup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// MinAvailable defines the minimal available tasks to run the Job;
// if there's not enough resources to start all tasks, the scheduler
// will not start anyone.
MinAvailable int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beberg it may be needed in many cases for a Job or ReplicaSet, etc. to have more replicas than those needed to be run together.


// The phase of pod; the controller takes actions according to this
// pod's phase. One of "PodFailed", "Unschedulable".
Event Event
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add a Timeout field here. When Timeout is specified, the controller waits for the Group members to satisfy the Group requirements. If the timeout expires, it takes the Action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does PodFailed mean in this context? Should this field instead be a Phase that indicates the state of the gang/group?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also add a Timeout field here.

+1, and Done

What does PodFailed mean in this context? Should this field instead be a Phase that indicates the state of the gang/group?

That means at least one Pod of Group are failed; PodGroupController need to take action accordingly. Did not introduce Phase for PodGroup as only the Pod's "controller" knows phase; for example, for a batch job, there maybe Failed, Done, Suspend, Terminated and so on Phase; PodGroupController does not know that :(.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 3, 2018
@k82cn
Copy link
Member Author

k82cn commented Oct 11, 2018

@bsalamat , @vishh do you have more comments ? :)

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Klaus. I think this is a good starting point. I have several minor comments. Otherwise, looks good.


As the lifecycle of Pods in PodGroup may be different from controller to another, the lifecycle of the members is not managed by the coscheduling feature. Each collection controller may implement or already have the mean to manage lifecycle of its members. The scheduler'll record related events for controller to manage pods, e.g. `Unschedulable`.

The update to `PodGroup` is not support for now; and deleting `PodGroup` does not impact Pod's status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/support/supported/

As batch scheduler and default scheduler may be running in parallel; the batch scheduler follows multi-scheduler feature to only handle the `Pod` that submitted to it. The batch scheduler does scheduling as follow:

1. Reconstruct 'Job' by the annotation of `Pod` and `PodGroup`
2. If there are less Pods than `numMembers` of `PodGroup`, the 'job' will not be scheduled; and an unschedulable event of `Pod` will be recorded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please replace numMembers with minMembers here and below.


As batch scheduler and default scheduler may be running in parallel; the batch scheduler follows multi-scheduler feature to only handle the `Pod` that submitted to it. The batch scheduler does scheduling as follow:

1. Reconstruct 'Job' by the annotation of `Pod` and `PodGroup`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what this item means. What do you mean by "reconstructing job"?

}
```

The `PodGroup`, which is a namespaced object, specifies the attributes and status of a pod group, e.g. number of pods in a group. To define which pods are member of `PodGroup`, the following annotation key is introduced for `Pod`; the annotation key is used for this alpha feature, and it'll be changed to a more permanent form, such a field, when moing `PodGroup` to core.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/moing/moving/

1. Reconstruct 'Job' by the annotation of `Pod` and `PodGroup`
2. If there are less Pods than `numMembers` of `PodGroup`, the 'job' will not be scheduled; and an unschedulable event of `Pod` will be recorded
3. In `allocate` phase, scheduler will
* record an `Unschedulable` event of `PodGroup` if some pods are running but `succeeded + pending + running < numMembers`, the controller takes action according to its configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, we do not delete completed pods as long as their parent object (such as Job, ReplicaSet, etc) is not deleted. This behavior is essential for the proposed algorithm to work correctly.

(This is a note to ourselves. Nothing needs to be changed in the design.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the minMembers field in PodGroup expected to track the number of non-terminated pods or all pods that have been successfully admitted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vish, I think "succeeded" means those pods that have run and finished successfully. This makes sense for "run-to-completion" pods.


### Multi-scheduler

Since multiple schedulers work in parallel, there may be decision conflict between different schedulers; and the kubelet will reject one pod (failed) if conflict. The controller will handle rejected pods based on its lifecycle policy for failed pods. It up to cluster admin to avoid such kind of conflict, e.g. node affinity which is supported by kube-arbitrator in 1.13.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the last sentence to
"Users and cluster admins may reduce the probability of such conflicts by partitioning the clusters logically, for example, by placing node-affinity to distinct set of nodes on various groups of pods."

2. Pods of a group/gang may have different `PodSpec` (and/or belong to different collections)
3. Existing controllers which are responsible for managing life cycle of collections work well with this feature

To meet the requirements above, the following **Kind** is introduced by CRD under `scheduling.incubator.k8s.io/v1alpha1` **Group**/**Version**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should it be incubator.scheduling.k8s.io? That way once it's out of incubation we only need to drop a sub-domain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yours seems better :)


// The number of pods which reached phase Failed.
// +optional
Failed int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is tracking Succeeded and Failed useful at the pod Group level? This seems to make an assumption that pod groups are typically used only for run to completion jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just give a summary of pod status here; not only for completion jobs.

The `PodGroup`, which is a namespaced object, specifies the attributes and status of a pod group, e.g. number of pods in a group. To define which pods are member of `PodGroup`, the following annotation key is introduced for `Pod`; the annotation key is used for this alpha feature, and it'll be changed to a more permanent form, such a field, when moing `PodGroup` to core.

```go
scheduling.k8s.io/group-name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Mark it alpha explicitly & be consistent with the object kind - alpha.incubator.scheduling.k8s.io

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm .... , mentioned this alpha feature in doc and release notes (in future); so when we migrate this feature to upstream, the user did not need update annotation in pod.


### Scheduling

The scheduler only watches `PodGroup` and `Pod`. It'll reconstruct 'Job' by annotation of Pod and `PodGroup`, the `Pod`s are considered as 'Task' of 'Job'; if annotation is empty, the scheduler records an unschedulable event of pod to ask user/controller to resubmit it. The schduler does not schedule pods until its `PodGroup` is created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if annotation is empty, the scheduler records an unschedulable

So the batch scheduler that you intend to develop will not schedule pods that do not need gang scheduling?

Copy link
Member Author

@k82cn k82cn Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the first version. As we do not know when/whether scheduler will get PodGroup in future, the PodGroup is required currently, but user can set minMember to 1.

I'm also thinking to add another "flag" to identify whether PodGroup is required to schedule the pod. Supporting jobs without gang-scheduling PodGroup in kube-batch should be another proposal.

As batch scheduler and default scheduler may be running in parallel; the batch scheduler follows multi-scheduler feature to only handle the `Pod` that submitted to it. The batch scheduler does scheduling as follow:

1. Reconstruct 'Job' by the annotation of `Pod` and `PodGroup`
2. If there are less Pods than `numMembers` of `PodGroup`, the 'job' will not be scheduled; and an unschedulable event of `Pod` will be recorded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation note: Do not overwhelm the system with repeated unschedulable events.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure :)

* record an `Unschedulable` event of `PodGroup` if some pods are running but `succeeded + pending + running < numMembers`, the controller takes action according to its configuration
* allocate (but not bind) resource to Pods according to Pod's spec, e.g. `NodeAffinity`
* bind all Pods to hosts until job is ready: if `numMembers` <= `allocated Pods` + `pending Pods`, it's ready when `numMembers` <= `allocated Pods`; otherwise, `numMember` <= `allocated Pods` + `succeeded Pods`
4. If can not allocate enough resources to the job, the pods stay pending; and the resource cannot be allocated to other job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues to consider:

  1. There is the risk of split brain where batch scheduler and the default scheduler might schedule onto the same resources which could in theory lead to one of the co-scheduled pods being rejected by the kubelet. What would you do if the batch scheduler cannot re-schedule that pod? Do we fall back on the controller to kill & resubmit the entire job? That feels like a sub-optimal solution from a user POV. Would it be possible to assign phantom pods to nodes to hold on to resources?
  2. Deadlock in the batch scheduler - If there are two jobs and both of them have half of their jobs allocated (not bound), then how will you make progress? Have you considered doing a transactional allocation - allocate all or free up allocated resources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we fall back on the controller to kill & resubmit the entire job? That feels like a sub-optimal solution from a user POV.

Yes, controller will re-submit the entire job. phantom pods maybe also preempted by high priority pods from default scheduler. For such a mix environment, prefer to only share resources between elastic 'jobs', e.g. Spark's executor and nginx/tomcat.

Deadlock in the batch scheduler ... allocate all or free up allocated resources?

In kube-batch 0.1, we use this option; two major concerns: 1. there're several computing here, pre-allocated and free up, 2. after free up, the resource may not be used by others, e.g. predicates.
So in kube-batch 0.2/0.3, I'm thinking to use preemption (or backfill) to handle this case: kube-batch will try to allocate resource by jobs's minMember as much as possible; if not enough resource, preempt allocated resource firstly which is occupied by pod-group but can not use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in kube-batch 0.2/0.3, I'm thinking to use preemption (or backfill) to handle this case: kube-batch will try to allocate resource by jobs's minMember as much as possible; if not enough resource, preempt allocated resource firstly which is occupied by pod-group but can not use it.

What triggers this "preemption"? When there are two pod groups with the same priority and each one is partially allocated and there is no more resources in the cluster, which one preempts the other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure which term is better: preempt or backfill; anyway, the assumed/allocated resource will be in "Allocated" state if the job did not get enough resource; and after allocation phase, those allocated but not bound resource will be re-calculated for other jobs. For your case, the two jobs will try to get those allocated resources in order (e.g. FCFS).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if FCFS works in this case. What if the pods of the two jobs created/processed alternatively? For example, batch scheduler processes pod 1 of job 1 and then processes pod 1 of job 2, then pod 2 of job 1, then pod 2 of job 2, and so on.

Copy link
Member Author

@k82cn k82cn Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in kube-batch, it'll do allocation in job/podgroup level (FCFS also for job/podgroup); so for this case, we will not handle pod2 of job2 until all pod in job1 are handled :)

* bind all Pods to hosts until job is ready: if `numMembers` <= `allocated Pods` + `pending Pods`, it's ready when `numMembers` <= `allocated Pods`; otherwise, `numMember` <= `allocated Pods` + `succeeded Pods`
4. If can not allocate enough resources to the job, the pods stay pending; and the resource cannot be allocated to other job

That may make resources (less than job's resource request) idle for a while, e.g. a huge job. The solution, e.g. backfill other smaller jobs to improve the resource utilization, will be proposed in coming release. In `allocate` phase, only pod's `NodeAffinity` takes effect; the other predicates/priorities will be included on-demand in coming release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backfilling soungs too optimistic. many k8s clusters may not have that many jobs to backfill.

Copy link
Member Author

@k82cn k82cn Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only one of solutions options :)


### Admission Controller

If quota runs out in the middle of creating a group of pods, a few members of a `PodGroup` may be created, while the rest will be denied by the `ResourceQuota` admission controller. `.spec.TotalResource` is added in `PodGroup` to address this problem. When a `PodGroup` is created with `.spec.TotalResource`, so much quota is reserved for the group if there is available quota. Pods of group use the already reserved quota. By setting `.spec.TotalResource` properly, one can ensure that Pods of a `PodGroup` have enough quota at creation time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do mention the changes required to the default quota controller to make this proposal work. That will be changes to the core project and needs to be acceptable to other quota subsystem owners.

Copy link
Contributor

@vishh vishh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Klaus, there are still some open questions & comments. Can you run through them once?

Signed-off-by: Da K. Ma <klaus1982.cn@gmail.com>
@justaugustus
Copy link
Member

/kind kep

@liggitt
Copy link
Member

liggitt commented Oct 18, 2018

just to verify, there are no in-tree changes proposed now, correct? this was a question in sig-arch while going through the features targeted at 1.13

@k82cn
Copy link
Member Author

k82cn commented Oct 18, 2018

just to verify, there are no in-tree changes proposed now, correct?

@liggitt , yes, we do NOT have code changes in-tree for this feature in 1.13.

@bsalamat
Copy link
Member

bsalamat commented Nov 6, 2018

/assign

@justaugustus
Copy link
Member

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

@justaugustus
Copy link
Member

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

@k8s-ci-robot
Copy link
Contributor

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k82cn
Copy link
Member Author

k82cn commented Dec 3, 2018

Moved to kubernetes/enhancements#639

@k82cn k82cn deleted the kep_583 branch December 3, 2018 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.